Data Science Nanodegree

Capstone Project

Darius Murawski

30.07.2021

I. Definition

Project Overview

Problem Domain

We will be talking about job salary, with a focus on coding oriented jobs. Salaries have several requirements that have to be taken into account:

and even more. All this will get represented by a single "number" (we wil be using anual USD). Its not easy to split the salary into the parts that give a more objective measurement and comparision with others.

Input Data overview

Thanks to the stackoverflow and their community, they release every year survey about the tech stack somebody is using, where they are working from and how much salary they get. This information can be obtained directly from stackoverflow in an anonymized fashing: stackoverflow.com.

Even historical data is available to watch for trends and timeslots, when new technologies, frameworks or languages arise. The data can be downloaded free of charge under the open database licence summary. Its allowed to:

As long as the produced result is made public again. As I am working with a public git repository, this should match the requirements.

Problem Statement

Job Salary is always a big topic in the business platforms as LinkedIn and the popular XING (in german speaking region). For both services you have to may an monthly subscription in order to get the required information you are interisted in.

Salary was and will always be a strong argument to search or change a job. On the opposite side, not many people want to share this informations, as they prefer to have no information before they share their salary in exchange.

I want to provide an easy to use web application that can predict a salary based on the top 15 features. This application will get the features as input using a html formular and return the prediction without doing any api calls further:

All potencial privat information will stay on the browser, so the user don't have to worry about the data. I want to make sure that the memory requirements for training aren't high to allow a local training on particular any modern device with 4GB ram. We will be using tensorflow (the keras api) for training and tensorflow js for the prediction, embedded in a static html5 webapp.

Solution Statement

When data is available in a public space, this information shoud be as easy as possible to access by everyone. I want to make the data easy available for everyone, so we will be building an app where each feature can be passed in and than we get a prediction what we can compare to the salary we or others get.

Metrics

As we are dealing with numeric predictions within a range, its a classic regression problem we have to solve. We will be using the Root Mean Squared Error as common metric for this kind of problems. Why RMSE:

II. Analysis

Data Exploration

The data of stackoverflow contains a lot of columns (61 in total). Most of the data is a string of a set of possible answers. Only a small amount contains floating point values. In total, we have 64461 answers, where about 53.9% (34756) contain an answer related to their current job salary, less than we got compared to the job satisfaction we analysed at the beginning of this year (70%).

Respondent

Anonymized identifier of a survey response. Unique for each row in the dataset, numeric. Not helpful for our use case, so it can be removed.

MainBranch

An exclusive answer, if developing is the main job the respondent have. Possible values are:

Most work as professional developers.

Hobbyist

If they code as a hobby.

Most code in their freetime as well, some don't give an answer.

Age

The age the respondent had at the survey time.

Their seem to be no validation on the input data for the Age column - we need to remove the answers in the following steps as I reject an age of 99 to be a professional developer. We have nans here as well.

Age1stCode

The amount of years the responent had when they started to code.

We have nans here again. And beside numeric values, also two options could have been selected with Older than 85 and Younger than 5 years.

CompFreq

How often did they receive the salary? In germany a monthly paid salary is popular but the contract contains a annual salary. In other parts of the world this looks different.

Yearly is the most popular one. We have nans here as well.

Country

Where the respondant is living.

Most answer where from the USA (almost 20%).

CurrentyDesc and CurrentySymbol

The Currency that is used in the country. This was used by stackoverflow to convert the salary into USD dollars based on exchange rates of a particular day.

DatabaseDesireNextYear and DatabaseWorkedWith

Databases that are or will be used in the next year.

This is the first non-exclusive answer. From a given list of options, the user could select multiple. This is encoded in the source file by a semicolon character in the field. This can create a uniq combination for each user and have to be converted in the following steps to reduce the feature space.

DevType

Type of developer the respontent is. Multiple answer allowed.

EdLevel

Highest education level the respontent got - only one answer allowed.

Almost half of the users have a Bachelor's degree.

Employment

How their current job status is.

Most are employed on a full-time base.

Ethnicity

Ethnical Background

Multiple options could be selected. Most (64%) are White or of European descent.

Gender

Almost all participans where man (91%)

JobFactors

What are the important factors for choosing a job?

Multiple Options where available. Top answer combination with 2777 responses:

JobSat

Job Satisfaction. Only one option could be used.

Most are very satisfied with their job (32%).

JobSeek

If they are currently searching for a new job.

Most are open for new opportunities but are not looking for a new one (58%).

LanguageDesireNextYear and LanguageWorkedWith

Programming languages that the user work with or want to work with. Multiple answers possible.

Top Combination here is HTML, CSS, JavaScript, PHP and SQL. Top value for next year is Python

MiscTechDesireNextYear and MiscTechWorkedWith

Different Tools that the user work with or want to work with. Multiple answers possible.

Most have used Node.js as Tool or mant to start using it in the next year.

NEWCollabToolsDesireNextYear and NEWCollabToolsWorkedWith

List of Collaboration Tools, most response is the github platform.

NEWDevOps

Does your company have a dedicated DevOps person?

Most have a DevOps Team

Following are not so important columns, but for have it complete, I list them here. Description was extracted from the zip from stackoverflow itself.

NEWDevOpsImpt

How important is the practice of DevOps to scaling software development?

NEWEdImpt

How important is a formal education, such as a university degree in computer science, to your career?

NEWJobHunt

In general, what drives you to look for a new job? Select all that apply.

NEWJobHuntResearch

When job searching, how do you learn more about a company? Select all that apply.

NEWLearn

How frequently do you learn a new language or framework?

NEWOffTopic

Do you think Stack Overflow should relax restrictions on what is considered “off-topic”?

NEWOnboardGood

Do you think your company has a good onboarding process? (By onboarding, we mean the structured process of getting you settled in to your new role at a company)"

NEWOtherComms

Are you a member of any other online developer communities?

NEWOvertime

How often do you work overtime or beyond the formal time expectation of your job?

NEWPurchaseResearch

When buying a new tool or software, how do you discover and research available solutions? Select all that apply

You search for a coding solution online and the first result link is purple because you already visited it. How do you feel?

NEWSOSites

Which of the following Stack Overflow sites have you visited? Select all that apply.

NEWStuck

What do you do when you get stuck on a problem? Select all that apply.

OpSys

What is the primary operating system in which you work?

OrgSize

Approximately how many people are employed by the company or organization you currently work for?

PlatformDesireNextYear and PlatformWorkedWith

Which platforms have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the platform and want to continue to do so, please check both boxes in that row.)

PurchaseWhat

What level of influence do you, personally, have over new technology purchases at your organization?

Sexuality

Which of the following describe you, if any? Please check all that apply. If you prefer not to answer, you may leave this question blank."

SOAccount

Do you have a Stack Overflow account?

SOComm

Do you consider yourself a member of the Stack Overflow community?

SOPartFreq

How frequently would you say you participate in Q&A on Stack Overflow? By participate we mean ask, answer, vote for, or comment on questions."

SOVisitFreq

How frequently would you say you visit Stack Overflow?

SurveyEase

How easy or difficult was this survey to complete?

SurveyLength

How do you feel about the length of the survey this year?

Trans

Are you transgender?

UndergradMajor

What was your primary field of study?

WebframeDesireNextYear and WebframeWorkedWith

Which web frameworks have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the framework and want to continue to do so, please check both boxes in that row.)"

WelcomeChange

Compared to last year, how welcome do you feel on Stack Overflow?

WorkWeekHrs

On average, how many hours per week do you work? Please enter a whole number in the box

YearsCode

Including any education, how many years have you been coding in total?

Most have 10 years experience'

YearsCodePro

NOT including education, how many years have you coded professionally (as a part of your work)?

Most have 3 years of processional coding experience

Data Visualization

In this Chapter we will look a bit more close to the distribution on the values

Age

The age distrubution looks suprising, as we have on the one hand a heavy outlier with a value of 279 what is with current medical equipment impossible. But all values higher than 65 seems a bit unrealisitc as well, because you may go on pension and stop "coding".

I would consider removing all outliers from 3 times the stddev from the dataset in a preprocessing step. This implied for the really low values as well.

YearsCode

We have to remove / convert the already mentioned categorial values from the data in order to plot them

This looks realistic, interisting that a lot picked a round numer (10, 20, 30, 40) - maybe they don't remember more precise.

YearsCodePro

With a peak of 3 years professional coding experience, it looks like more junior developer are visiting stackoverflow.

III. Methodology

Data Preprocessing

Several preprocessing steps where implemented. We will go in details here what each is doing and why it was required. They are explained in order of execution. Having the preprocessing in several files, allowed me to simplify accuring problems while coding and focus on solving one task for a step. Most columns can be nan- that needed a lot of attention.

File: preprocessing.py

This was the first preprocessing step. It explodes the list of possible answers into one-hot encoded columns for non-numeric answers. For example the column LanguageWorkedWith allowed multiple answers. When we found a semicolon, we extracted this values to own column. This resulted in a very wide table. I was not able to train the model on this representation (loading in pandas failed already). Nans where encoded as own column.

File: minmax.py

After I once completed all steps, the loss function exploded into inf instead of decreasing. This was fixed by adding scaling to all columns (in particular the numeric values), including the target value. This executable calculates the min and max by each column. But its not the absolute min and max, instead it is a value I would consider as lowest possible (3 * stddev of mean) or highest possible. After exploding, you can have only categorical values with a 1 values - this is specially handled to not mess up the following steps.

File: reject.py

Reading the data again and "mark" all Respondent id that should be removed from the dataset due the missing the min max constraints.

File: scale.py

Now we finally scale the data by dividing using the previously calculated max value (we ignore the min value). This step will skip rows marked by reject.py, indirectly removing them from the dataset.

Implementation

File: train_features.py

After having implemented the full preprocessing step, I still was not able to train the data with memory restrictions. Thats by I cheated a bit and trained on a (I call it) feature group. What is a feature group: For each column, I added the target label and the one-hot encoded ones into one file and trained it in keras.

As this was a lot less heavy (in amount of columns) it worked very well. The model itself is as simple as possible. As I don't have that many samples, adding more layers would not improve it that much and increase the inference time on the prediction side later - performance wins for my use case. The relevant part from shared/train.py:

def build_and_compile_model():
    model = keras.Sequential([
        layers.Dense(1)
    ])

    model.compile(loss=tf.keras.losses.MeanSquaredError(),
                  optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
                  metrics=[tf.keras.metrics.MeanSquaredError(),
                           tf.keras.metrics.MeanAbsoluteError()])
    return model

and the fit call itself:

model = build_and_compile_model()

history = model.fit(X_train.values, y_train.values,
                    verbose=1,
                    epochs=epochs, batch_size=1,
                    validation_data=(X_test, y_test))
return model, history, list(df.columns)
`

Refinement

File: select_features.py

After running all models, I had to pick the best one. I decided to use the best 15 models together that where training only on one feature group. This number is considered (by me) as a good ratio between time to input the data in the prediction case and quality of the model. The second positive aspect: It makes sure it fits into the memory. The 15 models where selected by their RSME score: The 15 lowest ones where used in train_all.py.

File: train_all.py

Finally the best selected features are combined together, creating the final model. We do this by split the best features into 5 folds and train on each fold with a classic 80/20 split. The Model with the lowest RMSE in any epoch is seleted as winner. All models share a low RMSE score beside the first fold, we may be unlucky with the data or the test set (extrated from stdout):

Fold val RMSE Epoch 1 val RMSE Epoch 2 val RMSE Epoch 3
0 125.12760162353516 132.7395782470703 135.8756103515625
1 0.0006983008934184909 0.0006387992762029171 0.000556076702196151
2 0.009237580001354218 0.008935610763728619 0.00845410581678152
3 0.00038578390376642346 0.0002545823808759451 0.0001253084046766162
4 0.003603751305490732 0.0033430070616304874 0.0029510962776839733

Our best Epoch got a RMSE score of 0.0001 - really good! As of this good results I did not do any further hyperparemter search.

File: template.py

Now we are missing the serving part. I decided to embedd this keras model in a static website using tensorflow js. This executable will generate the entire html template that can than be opended and used locally, without any other api calls, preventing any data flow from outside the browser into the world.

Results

Model Evaluation and Validation

For all intermediate models the validation RMSE was used. The lower, the better. We did a classic 80/20 split with no hyperparameter tuning due to the simplicity of the model. Here are the results:

As you can see from the output, each individual model was not able to outperform the combined one, that have a magnitude lower val_mean_squared_error - Combined them together give us a very good performance and return robust results.

Justification

lets compare this results with the final model:

After one epoch, we can see that the MSE is 0.010, that it increased again to 0.011 and gets back to 0.010 - our model did not improve after the 3. epoch, so I canceled the calculation.

The model did very well in predicting the job salary.

Conclusion

Reflection

We did several preprocessing steps to get the data in a format that can be "understand" by ML. Exploding the columns, rejecting outliers, scaling the data. To reduce memory we trained only on one feature group and combined the best 15 models together. They get served in the (local) browser to have a save environment to get predictions from it.

Job Salary is always a hot topic. The more transparent we are handling it, the better for everybody involved. A neutral source and this model can help to get a better feeing for the salary. Challinging was the inf loss while training before the scaling was implemented. I took me quite a long time to figure it out.

Improvement

In the data fetching step I added other years of survey as well (beside 2020), but did not use them in my solution - it may be worth it to try to merge the different survey sources together to have more data points available. As the sulution have a high automatic stage, more manual work can be added in the validation of values from the columns - for example the max salary after the scaling is still very high (500k USD) what is imo out of bounce for most developers I know. Maybe the US market is different.

Deliverables

Application

If you want to run the entire pipeline locally, just use the all.sh script in the project root directory. For more details on installation, please visit README.md. This will do everything you need (even download the data) - finally you are able to serve the generated html template using the build-in python webserver (as its static html/js) - the server will be run in the script as well. It took about 30 minutes to run everything.

Github Repository

You can get the Code from the following github repository: https://github.com/dariusgm/stackoverflowcapstone